Introduction to Open Data Science - Course Project

Chapter Topic Max points Book Ch (K.V.) Book Ch R for HDS Assignment done?
1 Start me up! 20 1 & 2 -
2 Regression and model validation 5+15 3 & 4 7
3 Logistic regression 5+15 5 & 6 7
4 Clustering and classification 15+5 - 12, 17 & 18
5 Dimensionality reduction techniques 5 + 15 13 & 14 -
6 Analysis of longitudinal data 5 + 15 8 & 9 -
Misc Misc

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec 12 22:36:21 2022"

The text continues here.

Hello, testing this!! ZOM!

Hello World, why are my commits not pushing through?

I was feeling happy and motivated! I got R, Rstudio, Git and Github up and running. Now I see my commits on Github but unfortunately my diary is not updating. Apparently, according to Github, the updating of a public page may take up to 24h. I feel like that is a very long time and I wonder if commits made directly on Github (not remotely with Git) would take as long.

I discovered the IODS-course on an email sent to all doctoral schools. I’m looking forward to learning about rMarkdown, project management in Github, and clustering.

Looking forward to this course!

Update: I got to attend our course IT-help-zoom. It was great! Even though my diary did not update despite our efforts. Apparently everything looks good and my diary should update if I wait some more. The info was reassuring so I’ll be waiting more serenely now!

Meanwhile, I’ll keep myself busy with R for Health Data Science as suggested during the course introduction lecture. It seems very informative and easy to read. I’m somewhat familiar with R but I haven’t used tidyverse much. Piping gets me a bit confused but it’s my favorite topic! Exercise Set 1 is interesting and useful, but it did require some package installations, but I’m good to go now and ready to keep working now that all packages are…packed along!

Update v2 We did it! The diary has updated itself after 2h 13min of me patiently waiting (and eagerly non-stop pushing new commits).

Link to my Github repo https://github.com/hsanez/IODS-project

Yay! A happy girl at a laptop By Marina Vishtak

Ch2 Regression and model validation

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
date()
## [1] "Mon Dec 12 22:36:21 2022"

Here we go again…

Libraries used for Ch2

Load the needed R packages before getting to work: Obs! You might need to install the packages first if you haven’t used them before (see install.packages()).

#install.packages(c("tidyverse","GGally","readr"))
library(tidyverse)
library(GGally)

Data wrangling

1. Read data (1p.)

I created a new data folder in my IODS-project folder. It includes an R script named ‘create_learning2014.R’, which is the R code for this data wrangling part of Assignment 2. The original data I’m using is the full learning2014 data downloaded from here. The data includes headers and uses tab as a separator. As per the course assignment instructions I have commented the data and my code in ‘create_learning2014.R’.

✓ Done!

2. Create an analysis dataset (1p.)

In this part we created an analysis dataset which contains 7 variables: gender, age, attitude, deep, stra, surf and points. Some of these (attitude, deep, stra, surf, points) were made by combining questions in the learning2014 data, as defined in our Exercise Set and also on the bottom part of this page. The combination variables were scaled and observations where points = 0 were excluded. The data has now 166 observations and 7 variables. See my full comments and code in ‘create_learning2014.R’.

✓ Done!

3. Save new data set (3p.)

I set the working directory of my R session to the IODS Project folder with setwd() and then saved the analysis dataset to the data folder as learning.csv with write_csv() (readr package, part of tidyverse). ?write_csv was a good help. With read_csv(), str(), dim() and head() I made sure the data was correctly written. Again, see my full comments and code in ‘create_learning2014.R’.(3 points)

✓ Done!

Analysis

  • The Analysis exercise focuses on performing and interpreting regression analysis.
  • include all the codes, your interpretations and explanations in the R Markdown file chapter2.Rmd
  • the output of your Analysis should appear in your course diary (update your local index.html file by knitting the index.Rmd file (which includes chapter2.Rmd as a child file) and then push the changes to GitHub).
  • Write a continuous report with a clear structure.
  • For full points you should be able to show an understanding of the methods and results used in your analysis. Clear, understandable and comprehensive explanations are worth full points. -Feel free to also use material outside this course as learning sources.

R Markdown Hint: When you knit the document, the working directory is temporarily set to be the folder where your R Markdown file is located. This is good to be aware of when reading data for example.

REPORT (Assignment 2)

1. Read and describe data (3p.)

First we start by reading the analysis data to R. Make sure all required packages are installed (see beginning of Ch2). The data we’re using is the one created in the data wrangling part of assignment 2. The data can be also read directly from this page, where the separator is a comma and headers are included.

#load required packages
library(tidyverse)
library(GGally)
# Read data (Make sure you're in the correct working folder with getwd())
lrn14a <- read_csv("data/learning2014.csv")
# Alternative site for reading the data
#lrn14a <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt")
#Structure of the data
str(lrn14a)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Dimensions and first peek at the data
dim(lrn14a);head(lrn14a)
## [1] 166   7
## # A tibble: 6 × 7
##   gender   age attitude  deep  stra  surf points
##   <chr>  <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 F         53      3.7  3.58  3.38  2.58     25
## 2 M         55      3.1  2.92  2.75  3.17     12
## 3 F         49      2.5  3.5   3.62  2.25     24
## 4 M         53      3.5  3.5   3.12  2.25     10
## 5 M         49      3.7  3.67  3.62  2.83     22
## 6 F         38      3.8  4.75  3.62  2.42     21

The analysis data is a subset of a full data called ‘learning2014’. The full data is the result of a survey on attitude towards studying and statistics made between 3.12.2014 - 1.10.2015 for students of an introductory course to social statistics.

The results are student’s answers to questions on different subtopics of learning, and some of these subtopics are summarized as new variables in our analysis dataset. This subset we created and read, ‘learning2014.csv’, is an analysis dataset which contains 7 variables: gender, age, attitude, deep, stra, surf and points. Points means the maximum exam points of the student.

Some of these variables were made by combining questions in the learning2014 data

  • attitude = summary variable of questions about the student’s attitude towards statistics
  • deep = deep approach = summary variable of questions to assess whether the student is using a deep approach to learning
  • stra = strategic approach = summary variable of questions to assess whether the student is using a strategic approach to learning
  • surf = surface approach = summary variable of questions to assess whether the student is using a surface approach to learning

See the survey questions and combinations made for the above mentioned variables on this page.

The summary variables were scaled to a Likert scale (scale from 1 to 5) and observations (students) with 0 exam points were excluded.

The remaining subset of data has 166 observations (students) and 7 variables.

2. Graphical overview and summary (3p.)

We will look at a graphical overview of our data. We have 7 variables of which gender seems to be a dichotomous variable without missing values, and more female than male students:

summary(lrn14a$gender); unique(lrn14a$gender); table(lrn14a$gender)
##    Length     Class      Mode 
##       166 character character
## [1] "F" "M"
## 
##   F   M 
## 110  56

We’ll use this information to split our graphical output in two (male and female students) and draw summary plots of the remaining variables:

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14a, mapping = aes(col=lrn14a$gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

In this plot matrix we can see all variables split by gender, boxplots on top, and on the far left histograms, then scatters and density plots, and finally Pearson correlation data with statistical significance marked with asterisks (see GGally documentation).

As we can see, there are

  • more female (red) than male (blue) students.
  • Most students are under 30 years old, overall female students tend to be somewhat younger, the median age being lower.
  • More male students have a more positive attitude towards studying and statistics.
  • Both genders seem to have a high score on deep approach to studying but female students show a little bit more strategic and surface level approaches to studying than males.

For Pearson correlation results we can see both an overall and gender specific correlation coefficients.

  • It seems that attitude towards statistics and exam points are positively correlated on a statistically significant level (p<0.001): a higher summary score in attitude is associated with higher exam points regardless of gender.
  • Logically, we see surface level and deep level approach to studying to be negatively correlated (p<0.001) among male students but interestingly not among females.
  • There is also some signs of a negative correlation between attitude and surface level approach to studying among males.

3. Regression model fitting (4p.)

We’ll proceed to creating a regression model. As instructed we’ll choose three explanatory variables to our model where exam points is the dependent (outcome) variable. I’ll base my choice on the correlation information above, and I’ll try to avoid possible collinearity between the chosen variables. Since there are multiple explanatory (independent) variables, we’ll use a multiple regression model approach.

Dependent variable:

  • points

Independent variables:

  • attitude (statistically significant correlation with points)
  • stra (large correlation coefficient)
  • surf (large correlation coefficient)
library(ggplot2)
p <- ggpairs(lrn14a, mapping = aes(col=lrn14a$gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# create a regression model with multiple explanatory variables
reg_model <- lm(points ~ attitude + stra + surf, data = lrn14a)

# print out a summary of the model
summary(reg_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

From the summary of the model we can see that

  • attitude is positively and statistically significantly (p=1.9E-8) associated with exam points. From the estimated we can assume that a rise of 1 point in the mean of attitude points (as in a summary variable of attitude driven questions) means an approximate rise of 3.4 exam points when the other explanatory variables remain unchanged.
  • Strategic or surface approach to studying seem to not be associated with exam points.

The t values are t statistics that have been calculated as estimate/standard error.

The F-statistic has a very low p-value, meaning that there is evidence that at least not all of the coefficients of the studied explanatory variables are zero.

R-squared tells us about the fit of the model to our data. Here, since we have used multiple explanatory variables, we’ll look at the adjusted R-squared which takes into account and corrects for the variance caused by the multiple variables used in a regression model. Here we see an adjusted R-square of 0.19, meaning that the three chosen explanatory variables account for approx. 20% of the variation in exam points.

Since only attitude seems to be statistically significantly associated with the dependent variable, we’ll rerun the regression model. We’ll keep attitude as an explanatory variable and instead of surf and stra, we’ll test age and gender as explanatory variables.

# create a regression model with multiple explanatory variables
reg_model_extra <- lm(points ~ attitude + stra + surf, data = lrn14a)

# print out a summary of the model
summary(reg_model_extra)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

4. Results - model simplification (3p.)

Since only attitude seems to be statistically significantly associated with the dependent variable, points, we’ll run the regression model once more, this time only with attitude as an explanatory variable, to fit the model better and get a more parsimonious model. We’ll also draw a scatter plot with a regression line to visualize the result.

# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = lrn14a) + geom_smooth(method = "lm")

# create a regression model with multiple explanatory variables
reg_model2 <- lm(points ~ attitude, data = lrn14a)

# print out a summary of the model
summary(reg_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The regression line fits the scatter plot quite nicely, with a positive slope and tight 95% confidence intervals. Again, we see that attitude is statistically significantly associated with points (p=4.1E-9), and the amount of points rise by 1 when attitude points based on the original survey rise by approx 3.5. The estimate is a little bit higher and the p-value a little bit lower than before. Since we have only one explanatory variable we can assess the multiple R squared of the model (0.19), which should now equal to the square of the correlation coefficient between attitude and points. This means that almost 20% of points can be explained by attitude.

5. Diagnostic plots (3p.)

We have fitted a regression model to our data and interpreted the results. Now we will assess the model to check whether it complies to our statistic assumptions, and that it is a sensible model to use on our data.

Graphical model validation with plot():

  • Residuals vs Fitted values (1)
  • Normal QQ-plot (2)
  • Residuals vs Leverage (5)
# par() could be used to show plots in a matrix
# par(mfrow = c(2,2))
# draw diagnostic plots using the plot() function.
plot(reg_model2, which=c(1,2,5))

Linear regression modeling has four main assumptions (quoted from R for Health Data Science):

  1. Linear relationship between predictors and outcome
  2. Independence of residuals
  3. Normal distribution of residuals
  4. Equal variance of residuals
  • In plot ‘Residuals vs Fitted’ values we see that the variance of the residuals (outcome variable = points) seem to stay equal (assumption 4 ✓).
  • In the Normal QQ plot we see quite normally distributed residuals, some residual data points curving from the normal line in both ends of the plot line, which means that there are some extreme values in the data but the distributions seems normal (assumption 3 ✓)
  • From the plot ‘Residuals vs Leverage’ we can estimate whether there are observations that affect the model significantly more than others (outliers). As from the QQ plot we see that there are some extreme values but none of them cross the Cook’s distance treshold so we can determine that there are no critical outliers affecting the our model.

List of diagnostic plots for plot() and which():

which graphic
1 Residuals vs Fitted values
2 Normal QQ-plot
3 Standardized residuals vs Fitted values
4 Cook’s distances
5 Residuals vs Leverage
6 Cook’s distance vs Leverage

✓ ✓ ✓ ✓ ✓ Done!

Ch3 Logistic regression and cross-validation

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
date()
## [1] "Mon Dec 12 22:36:35 2022"

Here we go yet again…

Libraries used for Ch3

Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).

#install.packages(c("tidyverse","GGally","readr"))

Load the needed R packages before getting to work:

#load required packages
library(tidyverse)
library(GGally)
library(dplyr)
library(ggplot2)

Data wrangling

See create_alc.R on my Github repository.

Data analysis

REPORT (Assignment 3)

The purpose of this analysis is to study the relationships between high/low alcohol consumption and some of the other variables in our data.

2. Read and describe the data

Read the analysis data.

# Read data (Make sure you're in the correct working folder with getwd())
alc <- read_csv("data/student_alc.csv")
# Alternative site for reading the data
#alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv")

Describe the data.

# print variable names and dimension of the tibble
colnames(alc); dim(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
## [1] 370  35

The analysis dataset we are using for this course assignment consists of the above listed 35 variables and 370 observations.

Our analysis dataset is a subset of an original data ‘Student Performance Data Set’ by Prof. Cortez (University of Minho, Portugal) collected from two Portuguese schools during the school year 2005-2006. The original dataset was collected using questionnaires and school reports.

The analysis dataset was created by joining two datasets of the original data; student data from two school subjects: Mathematics and Portuguese language. These subject-specific datasets were merged by their common variables, and only students present in both subject datasets were included in the resulting analysis dataset. Naturally there was variation in the following time-related variables:

  • failures = number of past class failures (numeric: n if 1<=n<3, else 4)
  • paid = extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  • absences = number of school absences (numeric: from 0 to 93)
  • G1 = first period grade (numeric: from 0 to 20)
  • G2 = second period grade (numeric: from 0 to 20)
  • G3 = final grade (numeric: from 0 to 20, output target)

For the numeric variables we calculated student-specific averages using both of the datasets (Mathematics and Portuguese language). For the binary variable (paid) we chose the value from the Mathematics dataset.

See the R code for the exact course of our data wrangling.

You can find the variable descriptions and more information about (and download) the original dataset on this page.

3. Hypotheses on the relationships between alcohol consumption and chosen variables

My chosen variables:

  • age = student’s age (numeric: from 15 to 22)
  • sex = student’s sex (binary: ‘F’ - female or ‘M’ - male)
  • famrel = quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • absences = number of school absences (numeric: from 0 to 93)

The original dataset includes information on students from 15 to 22 years. This means all student’s aren’t of legal drinking age, which seems to be 18 in Portugal. They won’t be able to buy alcohol by themselves which could be seen in the relationship between age and alcohol use; there should be less alcohol use among younger students.

In average, men are larger than women, and their body composition differs, which makes their physiology and thus alcohol metabolism different from women’s. This results in male tolerance for alcohol being usually better, which may lead to them drinking more before feeling the effects of alcohol. For this possible reason I think there might be a relationship between sex and alcohol use.

Students might spend a lot of time at home both studying or during their freetime. Thus, bad family relations could affect studying possibilities, motivation and recovering from studying. One reason for bad family relationships may be alcohol itself, maybe overused by the student itself or by other family members.

Students drinking more alcohol might have more absences from school. Since weekday alcohol consumption is measured it could be seen as absences during the school week. Friday or Saturday use might not come up as absences in the data, assuming that these schools in Portugal have a regular Monday to Friday school week schedule.

My hypotheses:

  • More alcohol use among older students.
  • More alcohol use among male students.
  • More alcohol use among students with worse family relationships.
  • More alcohol use among students with more school absences.

4. Descriptive analyses

# Load needed libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(GGally)

# glimpse at the data
glimpse(alc); dim(alc)
## Rows: 370
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## [1] 370  35
# create a vector of the chosen variables + alcohol use
myvars <- c('age', 'sex', 'famrel', 'absences', 'alc_use', 'high_use')

# pivot data and draw a bar plot of the chosen variables
gather(alc[myvars]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

# draw summary table grouped by sex and low/high use of alcohol consumption
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = round(mean(age),1), mean_family_relation_quality = round(mean(famrel),1), mean_absences = round(mean(absences),1))
## # A tibble: 4 × 6
## # Groups:   sex [2]
##   sex   high_use count mean_age mean_family_relation_quality mean_absences
##   <chr> <lgl>    <int>    <dbl>                        <dbl>         <dbl>
## 1 F     FALSE      154     16.6                          3.9           4.3
## 2 F     TRUE        41     16.5                          3.7           6.9
## 3 M     FALSE      105     16.3                          4.1           2.9
## 4 M     TRUE        70     16.9                          3.8           6.1
# draw summary plot matrix stratified for sex as an overview of the data and correlations
p <- ggpairs(alc[myvars], mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

  • Absences:
    • Most students have either only a few (0-4) absences or very many (20-44) absences.
    • There is also a portion of students that have an average amount of absences (median=12).
    • Female students with high use of alcohol have the most absences.
    • Male students with no high alcohol use have the least amount of absences.
    • There seems to be a statistically significant positive correlation between alcohol use and absence (R=-0.123) regardless of sex.
  • Age:
    • Students are between 15 and 22, most of them under 19.
    • The correlation analysis implies that there is a statistically significant positive correlation between age and alcohol use among male students.
  • Family relations:
    • Most students seem to have good (4-5 on Likert scale) family relations.
    • Male students who don’t use high amounts of alcohol seem to have the best family relations and it seems to correlate statistically significantly with alcohol use.
  • Alcohol use:
    • There are more students who don’t use high amounts of alcohol (n=259).
    • Most of students who don’t use high amounts of alcohol are female students (n=154).
    • There are more male students who use high amounts of alcohol compared to female students (n=41).
  • Sex:
    • There are more female (n=195) than male students (n=175).
  • My hypotheses:
    • More alcohol use among older students. –> True among male students!
    • More alcohol use among male students. –> True!
    • More alcohol use among students with worse family relationships. –> True among male students!
    • More alcohol use among students with more school absences. –> True!

Nicely hypothesized - or - as Jorma Uotinen might say:

5. Regression analyses

Our outcome variable is dichotomous, or binary, so we will assess its relationship with our chosen variables with a logistic regression model.

Dependent variable:

  • High/low alcohol use (high_use) (dichotomous)

Independent variables:

  • age (numerical)
  • sex (dichotomous)
  • quality of family relations (famrel) (ordinal)
  • number of school absences (absences) (numerical)
library(dplyr); library(ggplot2)

# First, we define our logistic regression model m
# We'll use glm(), a generalized linear model function so we need to specify our model type to be logistic -> family="binomial", which makes the function use the link option 'logit'.
m <- glm(high_use ~ age + sex + famrel + absences, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + sex + famrel + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2151  -0.8371  -0.5950   1.0577   2.1545  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.44606    1.75957  -1.958 0.050175 .  
## age          0.17148    0.10306   1.664 0.096143 .  
## sexM         1.09171    0.24864   4.391 1.13e-05 ***
## famrel      -0.32224    0.12913  -2.495 0.012579 *  
## absences     0.08915    0.02298   3.879 0.000105 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 407.03  on 365  degrees of freedom
## AIC: 417.03
## 
## Number of Fisher Scoring iterations: 4
# Compute odds ratios (OR) and their confidence intervals (CI) for each variable
OR <- coef(m) %>% exp %>% round(2)
CI <- confint(m) %>% exp %>% round(2)
# print ORs and CIs as a table
cbind(OR, CI)
##               OR 2.5 % 97.5 %
## (Intercept) 0.03  0.00   0.98
## age         1.19  0.97   1.46
## sexM        2.98  1.84   4.89
## famrel      0.72  0.56   0.93
## absences    1.09  1.05   1.15
  • Sex, Family relations and absences seem to have a statistically significant association with high alcohol use (p<0.01).

  • Odds ratio is $ e^{}$

    • Male students have a 2.98 times the odds of using high amounts of alcohol compared to female students. The 95% confidence intervals are 1.8-4.9 (both above 1), which implies a statistically significant association.
    • Good family relations seem to be associated with less use of alcohol, the odds ratio is 0.72 (CI 0.6-0.9, both below 1) implying that good family relations are statistically significantly associated with a 28% (1-0.72) reduction on the risk of using high amounts of alcohol use.
    • Absences seem to be statistically significantly associated with high use of alcohol. The increase of school absences by one is associated with an increase of 9% in the odds of high alcohol use.
  • Age has no statistically significant association with high level of alcohol use.

  • My hypotheses:

    • More alcohol use among older students. –> True WRONG among male students!
    • More alcohol use among male students. –> True!
    • More alcohol use among students with worse family relationships. –> True among male students!
    • More alcohol use among students with more school absences. –> True!

In summary, three of my original hypothesis seem to stay valid after our correlation and regression analyses.

One hypothesis down! Oh no! But no worries…

6. Predictive power of the model

To predict the power of our model we will create a ‘confusion matrix’, some graphs (bar plot, scatter plot and ROC curve), and compute the training error of the model. Finally we’ll compare the performance of the model with a simple guessing strategy. Later, in part 7 of this report, we’ll test the performance of the model with cross-validation.

As instructed, we’ll use only the variables that showed statistical significance: sex, famrel, absences. Age had no statistically significant association with alcohol use level based on our regression model.

# predict() the probability of high_use
# We're using our model 'm' as the object of the predict()
# type = 'response' defines that we want our prediction results on the scale of the response variable, instead of the default scale. This means that when dealing with a binomial target variable (e.g. our 'high_use') we will get our prediction results as predicted probabilities instead of logit-scaled probabilities (default for binomial ).
probabilities <- predict(m, type = "response")

library(dplyr)
# add the predicted probabilities to our dataframe 'alc' in a column named 'probability'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
# Let's set our prediction treshold: If the probability is more than 50%, prediction = TRUE, otherwise it's FALSE.
alc <- mutate(alc, prediction = probabilities >0.5)

# check the last ten original classes, predicted probabilities, and class predictions
select(alc, sex, famrel, absences, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 6
##    sex   famrel absences high_use probability prediction
##    <chr>  <dbl>    <dbl> <lgl>          <dbl> <lgl>     
##  1 M          4        3 FALSE          0.387 FALSE     
##  2 M          4        0 FALSE          0.405 FALSE     
##  3 M          5        7 TRUE           0.437 FALSE     
##  4 F          5        1 FALSE          0.132 FALSE     
##  5 F          4        6 FALSE          0.247 FALSE     
##  6 F          5        2 FALSE          0.165 FALSE     
##  7 F          4        2 FALSE          0.187 FALSE     
##  8 F          1        3 FALSE          0.398 FALSE     
##  9 M          2        4 TRUE           0.568 TRUE      
## 10 M          4        2 TRUE           0.406 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% addmargins()
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   247   12 259
##    TRUE     76   35 111
##    Sum     323   47 370
# transform to proportions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins() %>% round(2)
##         prediction
## high_use FALSE TRUE  Sum
##    FALSE  0.67 0.03 0.70
##    TRUE   0.21 0.09 0.30
##    Sum    0.87 0.13 1.00

Almost one third of students drink alcohol an amount classifiable as ‘high use’ and our model underestimates the amount by predicting that only 11% of students use that much alcohol.

We can visualize our observations and predictions with a bar plot:

# initialize a plot of 'prediction' (predicted outcomes)
#basic colors
#p2 <- ggplot(data = alc, aes(x=prediction, col=prediction, fill=prediction))
# Custom colors
pred_colors <- c("#FFA500","#00FF00")
p2 <- ggplot(data = alc, aes(x=prediction, fill=prediction)) +
  scale_fill_manual(values=pred_colors)

# draw a bar plot of high_use by 'high use' (observations)
# delineate outcome variable labels
## define new labels
outlab <- c('True negative','True positive')
## Define old labels in same order
names(outlab) <- c('FALSE','TRUE')
p2 + geom_bar() + facet_wrap("high_use", labeller=labeller(high_use=outlab))

There are more FALSE observations (True negative observation = no high use of alcohol, left panel) than TRUE (True positive observation = high use of alcohol, right panel) observations. The prediction of true negative observations seems to be better than the prediction of true positive observations.

This can be presented also with a sleeker scatter plot:

# initialize a plot of 'high_use' versus 'probability' in 'alc'
p3 <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))

# define the geom as points and draw the plot
p3 + geom_point()


We can also visualize our confusion matrix with a ROC (receiver operating characteristic) curve. This will show the performance of our classification model with the true positive and false positive rate.

We’ll visualize the confusion matrix (sensitivity and specificity) with a ROC (receiver operating characteristic) curve. This will help us assess how well our logistic regression model fits the dataset.

  • Sensitivity = the probability that the model predicts a true positive outcome of the observations
  • Specificity = the probability that the model predicts a true negative outcome of the observations (=1-FPR)

The ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR).

  • TPR = TP/(TP+FN)
  • FPR = FP/(TN+FP)

The closer the curve comes to the upper left corner of the plot, the more accurate the test is. THe closer to the grey 45’ line the curve gets, the less accurate the test is.

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Define the curve with function roc(response, predictor), see ?roc() and pROC documentation (Resources)
roc_obj <- roc(alc$high_use, alc$probability)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
# Draw the plot
plot(roc_obj)

We can see that the curve differs from the 45’ curve, which means that our model seems to have at least some predictive value.

Training error

To calculate the training error (=the total proportion of inaccurately classified individuals) of our model, we’ll run the loss function (mean prediction error) presented in our course exercises loss_func(response, probability). It will take into account a binomial response/classification variable (‘high_use’) and the probability of the response to be TRUE for each individual.

  • class is either TRUE (1) or FALSE (0) for our variable of interest (‘high_use’)
  • prob (float between 0 and 1) is the individual probability that they are classified as TRUE for ‘high_use’

abs() gives the absolute value of the difference between class and prob, so if it is >0.5, the prediction has been wrong.

The function assigns the logicals ‘TRUE’ or ‘FALSE’ depending on whether the model’s prediction is false or true. OBS! The function assigns now ‘TRUE’ = 1, when the prediction has been false! Then it calculates the mean prediction error which is the number of false predictions divided by the number of total observations.

\[ \frac{n(FP + FN)*1 + n(TP + TN)*0}{N} \]

The function:

# define the loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  # print(n_wrong) #print this if you want to see the logicals the final result mean() is based on
  mean(n_wrong)
}


# call loss_func() to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378

A simple guessing strategy might be just guessing that no student ever drinks enough for it to be classified as ‘high use’ (high_use = FALSE, probability(high_use=TRUE)= 0) or all drinking very much (high_use = TRUE, probability(high_use=TRUE) = 1) or that every other student drinks (probability(high_use=TRUE) = 0.5). These situations can be tested with the loss_func() by determining a constant individual probability (prob) as described above.

# No students' drinking habits are 'high use', high_use=FALSE, probability(high_use=TRUE)= 0
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
# All students' drinking habits are 'high use', high_use=TRUE, probability(high_use=TRUE)= 1
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
# Every other student's drinking habits are 'high use', high_use=TRUE, probability(high_use=TRUE)= 0.5
loss_func(class = alc$high_use, prob = 0.49)
## [1] 0.3
loss_func(class = alc$high_use, prob = 0.51)
## [1] 0.7
# Every third student's drinking habits are 'high use', high_use=TRUE, probability(high_use=TRUE)= 0.3
loss_func(class = alc$high_use, prob = 0.3)
## [1] 0.3

The training error is 23.0%, which means that the models accuracy (1 - training error) is approximately 76%. By simply guessing how many students’ drinking habits are classified ‘high use’ we did not reach higher accuracy unless we guessed that every other students drinks:

Guess classified as ‘high use’ Guessed probability of ‘high use’ drinking Training error Accuracy
No one drinks 0 0.3 70%
Everybody drinks 1 0.7 30%
Every other drinks* 0.49 & 0.51 0.3 & 0.7 70% & 30%
Every third drinks 0.3 0.3 70%

OBS! *This assessment of exactly every other drinking (prob=0.50) can’t be done since the function is defined with a 0.5 treshold. Thus the close approximates were used.

In summary, our model seems to be better in predicting the level of alcohol use of our students than simple guessing.

7. Bonus: 10-fold cross-validation

Quoted from our course material: Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. Low value = good. Cross-validation gives a good estimate of the actual predictive power of the model. It can also be used to compare different models or classification methods.

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.2378378
# K-fold cross-validation
library(boot)
cvK10 <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cvK10$delta[1]
## [1] 0.2648649

The mean prediction error (0.25) combined from the rounds on the testing data seems to be a little bit lower than on the training data (0.24).

The model introduced in our Exercise set 3 had about 0.25 mean prediction error on the 10-fold cross validation on the testing data which implies that our new model (variables: sex, famrel, absences) has a similar test set performance to the one used in the Exercise set (variables: sex, failures, absences).

8. Super-Bonus: Comparison of logistic regression models with different sets of predictors

We’ll do a comparison of 11 different testing models, substracting the number of predictors in each model by always keeping the most significant variables (dropping the last 5 variables then later excluding only one variable per model).

Defining the models (logistic regression) and assigning predictions based on prediction probabilities

# check all possible variables
# -> 31 possible independent variables (we'll omit Daily and Weekend alcohol use since they were used for composing our outcome variable!)
colnames(alc)

# models and summaries
# detecting the least statistically significant variables
m1 <- glm(high_use ~  school + sex + age + address + famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup + famsup + activities + nursery + higher + internet + romantic + famrel + freetime + goout + health + failures + paid + absences + G1 + G2 + G3 , data = alc, family ="binomial")
summary(m1)

piis <- summary(m1)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + famrel + absences + reason + sex + guardian + address + activities + freetime + Mjob + health + paid + famsize + romantic + Fjob + studytime + traveltime + nursery + failures + G1 + school + Fedu + G2 + age + Pstatus + G3 + higher + internet + famsup + Medu + schoolsup

m2 <- glm(high_use ~  goout + famrel + absences + reason + sex + guardian + address + activities + freetime + Mjob + health + paid + famsize + romantic + Fjob + studytime + traveltime + nursery + failures + G1 + school + Fedu + G2 + age + Pstatus + G3, data = alc, family ="binomial")
summary(m2)

piis <- summary(m2)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + absences + famrel + reason + sex + guardian + address + activities + freetime + Mjob + paid + health + famsize + studytime + traveltime + nursery + romantic + failures + G1 + school + Fjob + Fedu + G2 + age + Pstatus + G3

m3 <- glm(high_use ~  goout + absences + famrel + reason + sex + guardian + address + activities + paid + Mjob + health + freetime + famsize + romantic + studytime + Fjob + traveltime + nursery + school + failures + G1, data = alc, family ="binomial")
summary(m3)

piis <- summary(m3)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + absences + famrel + reason + sex + guardian + address + activities + paid + Mjob + health + freetime + famsize + romantic + studytime + Fjob + traveltime + nursery + school + failures + G1


m4 <- glm(high_use ~  goout + absences + famrel + reason + sex + guardian + address + activities + paid + Mjob + health + freetime + famsize + romantic + studytime + Fjob, data = alc, family ="binomial")
summary(m4)

piis <- summary(m4)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + absences + famrel + sex + reason + address + guardian + health + Mjob + studytime + activities + paid + famsize + romantic + Fjob + freetime


m5 <- glm(high_use ~  goout + absences + famrel + sex + reason + address + guardian + health + Mjob + studytime + activities, data = alc, family ="binomial")
summary(m5)

piis <- summary(m5)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + absences + famrel + sex + reason + guardian + address + activities + studytime + health + Mjob


m6 <- glm(high_use ~  goout + absences + famrel + sex + reason + guardian, data = alc, family ="binomial")
summary(m6)

piis <- summary(m6)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + sex + absences + famrel + reason + guardian


m7 <- glm(high_use ~  goout + sex + absences + famrel + reason, data = alc, family ="binomial")
summary(m7)

piis <- summary(m7)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + sex + absences + famrel + reason


m8 <- glm(high_use ~  goout + sex + absences + famrel, data = alc, family ="binomial")
summary(m8)

piis <- summary(m8)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + sex + absences + famrel


m9 <- glm(high_use ~  goout + sex + absences, data = alc, family ="binomial")
summary(m9)

piis <- summary(m9)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + sex + absences


m10 <- glm(high_use ~  goout + sex, data = alc, family ="binomial")
summary(m10)

piis <- summary(m10)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + sex 


m11 <- glm(high_use ~  goout, data = alc, family ="binomial")
summary(m11)



# predict() the probability of high_use
probabilities1 <- predict(m1, type = "response")
probabilities2 <- predict(m2, type = "response")
probabilities3 <- predict(m3, type = "response")
probabilities4 <- predict(m4, type = "response")
probabilities5 <- predict(m5, type = "response")
probabilities6 <- predict(m6, type = "response")
probabilities7 <- predict(m7, type = "response")
probabilities8 <- predict(m8, type = "response")
probabilities9 <- predict(m9, type = "response")
probabilities10 <- predict(m10, type = "response")
probabilities11 <- predict(m11, type = "response")

# add the predicted probabilities to our dataframe 'alc' in a column named 'probability'
alc <- mutate(alc, probability1 = probabilities1)
alc <- mutate(alc, probability2 = probabilities2)
alc <- mutate(alc, probability3 = probabilities3)
alc <- mutate(alc, probability4 = probabilities4)
alc <- mutate(alc, probability5 = probabilities5)
alc <- mutate(alc, probability6 = probabilities6)
alc <- mutate(alc, probability7 = probabilities7)
alc <- mutate(alc, probability8 = probabilities8)
alc <- mutate(alc, probability9 = probabilities9)
alc <- mutate(alc, probability10 = probabilities10)
alc <- mutate(alc, probability11 = probabilities11)

# use the probabilities to make a prediction of high_use
# Let's set our prediction treshold: If the probability is more than 50%, prediction = TRUE, otherwise it's FALSE.
alc <- mutate(alc, prediction1 = probabilities1 >0.5)
alc <- mutate(alc, prediction2 = probabilities2 >0.5)
alc <- mutate(alc, prediction3 = probabilities3 >0.5)
alc <- mutate(alc, prediction4 = probabilities4 >0.5)
alc <- mutate(alc, prediction5 = probabilities5 >0.5)
alc <- mutate(alc, prediction6 = probabilities6 >0.5)
alc <- mutate(alc, prediction7 = probabilities7 >0.5)
alc <- mutate(alc, prediction8 = probabilities8 >0.5)
alc <- mutate(alc, prediction9 = probabilities9 >0.5)
alc <- mutate(alc, prediction10 = probabilities10 >0.5)
alc <- mutate(alc, prediction11 = probabilities11 >0.5)

The order of the variables (p-value) changed between regressions. However, the ‘top’ variables’ significance stayed and they kept their top positions on the regression summaries through all the regressions. The less significant variables changed rankings often, and thus they happened to be the ones to primarily get discarded from the model.

Nota bene!

I didn’t get my for-loops nor dynamic functions to work so part 8 scripts have been and will continue being ugly…sorry! But don’t worry, since…

Kittycat suffering from ugly-scriptitis

Training and testing errors by model

# Models numbered by amount of variables used
modelss <- c(31,26,21,16,11,6,5,4,3,2,1)

# call loss_func() to compute the average number of wrong predictions in the (training) data
pred_train <- loss_func(class = alc$high_use, prob = alc$probability1)
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability2))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability3))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability4))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability5))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability6))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability7))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability8))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability9))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability10))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability11))

#10-fold cross validation on testing data
cvK10_ss1 <- cv.glm(data = alc, cost = loss_func, glmfit = m1, K = 10)
cvK10_ss2 <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
cvK10_ss3 <- cv.glm(data = alc, cost = loss_func, glmfit = m3, K = 10)
cvK10_ss4 <- cv.glm(data = alc, cost = loss_func, glmfit = m4, K = 10)
cvK10_ss5 <- cv.glm(data = alc, cost = loss_func, glmfit = m5, K = 10)
cvK10_ss6 <- cv.glm(data = alc, cost = loss_func, glmfit = m6, K = 10)
cvK10_ss7 <- cv.glm(data = alc, cost = loss_func, glmfit = m7, K = 10)
cvK10_ss8 <- cv.glm(data = alc, cost = loss_func, glmfit = m8, K = 10)
cvK10_ss9 <- cv.glm(data = alc, cost = loss_func, glmfit = m9, K = 10)
cvK10_ss10 <- cv.glm(data = alc, cost = loss_func, glmfit = m10, K = 10)
cvK10_ss11 <- cv.glm(data = alc, cost = loss_func, glmfit = m11, K = 10)


# average number of wrong predictions in the cross validation
cvK10$delta[1]
## [1] 0.2648649
cvK10_ss <- cvK10_ss1$delta[1]
cvK10_ss <- append(cvK10_ss,cvK10_ss2$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss3$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss4$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss5$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss6$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss7$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss8$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss9$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss10$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss11$delta[1])


# create a tibble
supbon <- tibble(modelss, pred_train, cvK10_ss)

Trends of both training and testing errors by the number of predictors in the model

# Pivot data
longdf <- pivot_longer(supbon, c('pred_train','cvK10_ss'))

# initialize and draw a plot
ggplot(data=longdf, aes(x=modelss, y=value, col=name)) + geom_point() + geom_line() +
  # Edit legend title and labels
  scale_color_discrete(name = "Mean prediction error", labels = c("testing", "training")) +
  # Edit label names and title
  labs(x="Model", title='Trends of training and testing errors by the number of predictors in the model')

The mean prediction error is:

  • always lower in the training set.
  • very high in both the testing and training set when the model has only 1-2 predictors.
  • quite low in both the testing and training set when the model has 3-5 predictors.
  • quite high in both sets when there are 6-16 predictors in the model.
  • low in the training set and very high in the testing set when there are 21-31 predictors in the model.

\(\rightarrow\) The predictive logarithmic regression model is at its best with not too few nor too many variables \(\rightarrow\) 3-5 variables seems like a good option for model construction.

✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Done!

Ch4 Clustering and classification

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
date()
## [1] "Mon Dec 12 22:36:47 2022"

Here we go again…

Libraries used for Ch4

Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).

#install.packages(c("tidyverse","GGally","readr", "MASS", "corrplot"))

Load the needed R packages before getting to work:

#load required packages
library(tidyverse)
library(GGally)
library(dplyr)
library(ggplot2)
library(MASS)
library(corrplot)

Data analysis (15p.)

2. Loading and describing the data ‘Boston’

# load the 'Boston' data
data("Boston")

# explore the dataset
str(Boston);dim(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506  14

The ‘Boston’ dataset from the MASS library is a data frame of Housing Values in Suburbs of Boston. It has 506 observations (rows) and 14 variables (columns).

The variables are:

  • crim = per capita crime rate by town.
  • zn = proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus = proportion of non-retail business acres per town.
  • chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox = nitrogen oxides concentration (parts per 10 million).
  • rm = average number of rooms per dwelling.
  • age = proportion of owner-occupied units built prior to 1940.
  • dis = weighted mean of distances to five Boston employment centres.
  • rad = index of accessibility to radial highways.
  • tax = full-value property-tax rate per $10,000.
  • ptratio = pupil-teacher ratio by town.
  • black = 1000(Bk - 0.63)^2 where BkBk is the proportion of blacks by town.
  • lstat = lower status of the population (percent).
  • medv = median value of owner-occupied homes in $1000s.

The R documentation of the dataset can be found here.

3. A graphical overview and summary of the data

# summary of the data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# data from wide to long
long_Boston <- pivot_longer(Boston, cols=everything(), names_to = 'variable', values_to = 'value')

# Histograms of all variables in use
p1 <- ggplot(data=long_Boston)
p1 + geom_histogram(mapping= aes(x=value)) +
  facet_wrap(~variable, scales="free")

# Summary plot matrix with correlation and density plots
p2 <- ggpairs(Boston, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p2

All variables of the dataset ‘Boston’ are numeric. Most of the variables are skewed, so we’ll use Spearman’s correlation for correlation assessment later. There are multiple variables with possible outliers. Correlation data is difficult to see from this plot with this many variables and shows Pearson’s correaltions by default, so we’ll draw another one using the cor() (from the ‘corrplot’ library) focusing on the correlations between variables.

# calculate the correlation matrix
cor_matrix <- cor(Boston, method='spearman') 

# print the correlation matrix
cor_matrix %>% round(1)
##         crim   zn indus chas  nox   rm  age  dis  rad  tax ptratio black lstat
## crim     1.0 -0.6   0.7  0.0  0.8 -0.3  0.7 -0.7  0.7  0.7     0.5  -0.4   0.6
## zn      -0.6  1.0  -0.6  0.0 -0.6  0.4 -0.5  0.6 -0.3 -0.4    -0.4   0.2  -0.5
## indus    0.7 -0.6   1.0  0.1  0.8 -0.4  0.7 -0.8  0.5  0.7     0.4  -0.3   0.6
## chas     0.0  0.0   0.1  1.0  0.1  0.1  0.1 -0.1  0.0  0.0    -0.1   0.0  -0.1
## nox      0.8 -0.6   0.8  0.1  1.0 -0.3  0.8 -0.9  0.6  0.6     0.4  -0.3   0.6
## rm      -0.3  0.4  -0.4  0.1 -0.3  1.0 -0.3  0.3 -0.1 -0.3    -0.3   0.1  -0.6
## age      0.7 -0.5   0.7  0.1  0.8 -0.3  1.0 -0.8  0.4  0.5     0.4  -0.2   0.7
## dis     -0.7  0.6  -0.8 -0.1 -0.9  0.3 -0.8  1.0 -0.5 -0.6    -0.3   0.2  -0.6
## rad      0.7 -0.3   0.5  0.0  0.6 -0.1  0.4 -0.5  1.0  0.7     0.3  -0.3   0.4
## tax      0.7 -0.4   0.7  0.0  0.6 -0.3  0.5 -0.6  0.7  1.0     0.5  -0.3   0.5
## ptratio  0.5 -0.4   0.4 -0.1  0.4 -0.3  0.4 -0.3  0.3  0.5     1.0  -0.1   0.5
## black   -0.4  0.2  -0.3  0.0 -0.3  0.1 -0.2  0.2 -0.3 -0.3    -0.1   1.0  -0.2
## lstat    0.6 -0.5   0.6 -0.1  0.6 -0.6  0.7 -0.6  0.4  0.5     0.5  -0.2   1.0
## medv    -0.6  0.4  -0.6  0.1 -0.6  0.6 -0.5  0.4 -0.3 -0.6    -0.6   0.2  -0.9
##         medv
## crim    -0.6
## zn       0.4
## indus   -0.6
## chas     0.1
## nox     -0.6
## rm       0.6
## age     -0.5
## dis      0.4
## rad     -0.3
## tax     -0.6
## ptratio -0.6
## black    0.2
## lstat   -0.9
## medv     1.0
# visualize the correlation matrix
# cl.pos= position of the color legend, e.g. cl.pos = 'b'
# tl.pos= position of the text labels
# tl.cex = size of text labels
# type = type of plot, 'full', 'upper', 'lower'
# method = the visualization of the correlation coefficients, 'circle', 'square', 'pie', 'number', etc.
# Basic one-style correlation plot: corrplot(cor_matrix, method="ellipse", tl.pos ="d", tl.cex=0.6)
# Cool mixed correlation plot:
corrplot.mixed(cor_matrix, lower = 'number', upper = 'ellipse',tl.pos ="lt", tl.col='black', tl.cex=0.6, number.cex = 0.5)

The darker the color (see color legend, either blue or red) and the more straight-line like ellipse, the larger the correlation coefficient between variables. The exact coefficient can be seen on the lower part of the plot. The slope and the color (blue vs. red) of the ellipses depict the direction (positive vs negative) of the correlation. Unfortunately, with cor() we don’t have the p-values for the correlations.

There seems to be some strong negative (= more of A correlates to more of B) correlations between:

  • nitrogen oxide concentration (nox) and the weighted mean of distances to five Boston employment centres (dis).
  • percent of lower status of the population (lstat) and the median value of owner-occupied homes in $1000s (medv).
  • proportion of owner-occupied units built prior to 1940 (age) and the weighted mean of distances to five Boston employment centres (dis).
  • proportion of non-retail business acres per town (indus) and the weighted mean of distances to five Boston employment centres (dis).
  • per capita crime rate by town (crim) and the weighted mean of distances to five Boston employment centres (dis).

There are some positive (= more of A correlates to less of B) correlations too:

  • nitrogen oxide concentration (nox) and the proportion of non-retail business acres per town (indus).
  • nitrogen oxide concentration (nox) and per capita crime rate by town (crim)
  • nitrogen oxide concentration (nox) and proportion of owner-occupied units built prior to 1940 (age)
  • index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax).
  • full-value property-tax rate per $10,000 (tax) and per capita crime rate by town (crim)

In summary, some examples of possible correlations:

  • closer mean distance to five Boston employment centres seems to be correlated with higher
    • nitrogen oxide concentration,
    • proportion of owner-occupied units built prior to 1940,
    • proportion of non-retail business acres, and
    • per capita crime rate
  • higher per capita crime rate seems to be correlated with
    • shorter mean distance to five Boston employment centres
    • higher nitrogen oxide concentration
    • higher full-value property-tax rate
  • higher percent of the lower status of the population and lower median value of owner-occupied homes.

4. Data standardization, and division to training and testing sets

We’ll now standardize the data by scaling it. Boston data has only numerical variables, so we’ll be able to standardize them all in one go with scale().

Quoted from our course material In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation.

\[scaled(x) = \frac{x - mean(x)}{ sd(x)}\]

# center and standardize variables with scale()
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled <- as.data.frame(boston_scaled)

After scaling, the means of all variables is standardized to zero (0). Scaling the data with different original scales makes the comparison and analysis of the variables easier and possible. The scale() function transforms the data into a matrix and array so for further analysis we changed it back to a data frame.

Let’s replace the continuous crime rate variable with a categorical variable of low, medium low, medium high and high crime rate (scaled).

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

#check new dataset
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865

We’ll now divide the dataset to training (80%) and testing (20%) datasets for further model fitting and testing.

# number of rows in the scaled Boston dataset 
n <- nrow(boston_scaled)
n
## [1] 506
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set (random 80% of the scaled Boston data)
train <- boston_scaled[ind,]

# create test set (random 20% of the scaled Boston data)
test <- boston_scaled[-ind,]

5. Fitting of linear discriminant analysis (LDA) on the training set

We’ll fit the linear discriminant analysis (LDA), a classification (and dimension reduction) method, on the training set.

  • Target variable: crime (categorical)
  • Predictor variables: zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv
# linear discriminant analysis on the training set
# target variable = crime
# all other variables (.) are the predictor variables
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2425743 0.2500000 0.2549505 0.2524752 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       1.0169598 -0.9229374 -0.15180559 -0.8778800  0.47012031 -0.8863737
## med_low  -0.1463457 -0.2577223 -0.07742312 -0.5747185 -0.07093808 -0.4104997
## med_high -0.3690158  0.1730101  0.18636222  0.4187885  0.06104304  0.4369230
## high     -0.4872402  1.0149946 -0.11793298  1.0187470 -0.37704560  0.8085724
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8868719 -0.6935825 -0.7246173 -0.47341669  0.3867279 -0.79074192
## med_low   0.2866058 -0.5543231 -0.4732602  0.02053927  0.3070947 -0.22882499
## med_high -0.3944342 -0.4589285 -0.3482525 -0.34225826  0.1159547  0.01320748
## high     -0.8427242  1.6596029  1.5294129  0.80577843 -0.7053337  0.91034902
##                 medv
## low       0.52610280
## med_low   0.05478108
## med_high  0.16206130
## high     -0.71617412
## 
## Coefficients of linear discriminants:
##                 LD1          LD2          LD3
## zn       0.11321850  0.755645402 -1.042665174
## indus    0.08999244 -0.192398707  0.340064195
## chas    -0.13223567 -0.053073367 -0.006048574
## nox      0.08614333 -0.811847967 -1.290935906
## rm      -0.10740586 -0.007105597 -0.085065226
## age      0.29507475 -0.322876441 -0.372540105
## dis     -0.11804195 -0.289877209  0.007835232
## rad      4.14620182  0.875773433 -0.194856619
## tax      0.03830201 -0.008206037  0.598167786
## ptratio  0.09425565  0.072637571 -0.167518757
## black   -0.10436178  0.062644474  0.107438174
## lstat    0.15892576 -0.211133827  0.316119430
## medv     0.13350937 -0.434985561 -0.085162469
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9610 0.0292 0.0099

There are three discriminant functions (LD1, LD2, LD3). Let’s biplot the results of these functions:

# the function for lda biplot arrows (variables)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results, all dimensions
plot(lda.fit, dimen = 3, col = classes, pch = classes)

# plot LD1 vs. LD2
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

LD1 and LD2 are the strongest functions to separate our observations on the target variable (as we can also see from the proportion of trace on the summary of the LDA model output and the graphical overview of all the dimensions LD1 through LD3).

The graphical overview of or LDA model shows that LD1 and LD2 clearly separate the group of observations: high per capita crime rate (from our target variable, crime) is separated from the other, lower quartiles. Crime rate seems to be higher when index of accessibility to radial highways is higher. Other variables don’t draw as much attention.

6. Predicting with the LDA model on the testing set

To predict crime rate categories with our LDA model we have to save the true crime categories from the test set, and then remove them from the data to then create a variable with the predicted crime rate classes.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable (true/correct crime rate classes) from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results %>% add sums
tbres1 <- table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
tbres1
##           predicted
## correct    low med_low med_high high Sum
##   low       15      13        1    0  29
##   med_low    3      16        6    0  25
##   med_high   0       4       16    3  23
##   high       0       0        1   24  25
##   Sum       18      33       24   27 102
#number of rows in test set
nrow(test)
## [1] 102

Our model seems to predict crime rates quite nicely:

  • 51.7 % of low crime rate predicted correctly.
  • 64 % of medium low crime rate predicted correctly.
  • 69.6 % of medium high crime rate predicted correctly.
  • 96 % of high crime rate predicted correctly.
  • In total, 69.6 % of all 102 crime rate observations predicted correctly.

7. Model optimization with k-means algorithm

We want to investigate the the similarity between our observations, so we’ll create clusters of similar datapoints with running a k-means algorithm on the dataset.

We’ll first standardize the data for further analysis.

# Reload the 'Boston' data
data("Boston")

# Recenter and restandardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled <- as.data.frame(boston_scaled)

As instructed for our assignment, we’ll calculate the distances between observations to assess the similarity between our data points. We’ll calculate both euclidean and manhattan distances. It is worth noting that k-means clustering algorithm uses Euclidean distances.

# Calculate distances between the observations.
# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
## Run k-means algorithm on the dataset
## k-means clustering
# K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that.
set.seed(123)
km <- kmeans(boston_scaled, centers = 4)

# plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

pairs(boston_scaled[6:10], col = km$cluster)

With kmeans clustering with 4 clusters there seems to be some overlap between clusters (based on the plots). Let’s investigate the optimal number of clusters for our model with the total within sum of squares (WCSS) and how it changes when the number of clusters changes.

# Investigate what is the optimal number of clusters and run the algorithm again
# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line', ylab='Total within sum of squares (WCSS)', xlab='number of clusters')

The optimal number of clusters is when the total WCSS drops radically. Here, the optimal number of clusters for our model seems to be 2. Let’s run the algorithm again with 2 clusters and visualize the results.

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)


# Visualize the clusters
# plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

pairs(boston_scaled[1:5], col = km$cluster)

pairs(boston_scaled[6:10], col = km$cluster)

pairs(boston_scaled[11:14], col = km$cluster)

We determined that grouping our data with k-means clustering is best done with two clustering centroids (-> yields two clusters). However, we do not know what these clusters actually represent.

Form the plots we can see that there are very clear separate clusters at least within rad and tax.

To determine possible reasons for the clustering result we could draw a parallel coordinates plot… See this instruction or this.

Bonus: K-means and biplot visualization

# Reload the 'Boston' data
data("Boston")

# Recenter and restandardize variables
boston_scaled_bonus <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled_bonus)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled_bonus)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled_bonus <- as.data.frame(boston_scaled_bonus)

Let’s choose to run the kmeans algorithm with an arbitrary amount of four clusters.

## Run k-means algorithm with 6 cluster centers on the dataset

## k-means clustering
# K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that (used already in a previous chunck of code in this document)
km_b <- kmeans(boston_scaled_bonus, centers = 4)

# plot the scaled Boston dataset with clusters
#pairs(Boston, col = km$cluster)
#pairs(Boston[6:10], col = km$cluster)

#### Perform LDA using the clusters as target classes

# assess the kmeans() result summary
summary(km_b) #it is not a dataframe
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       56    -none- numeric
## totss          1    -none- numeric
## withinss       4    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           4    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
typeof(km_b) #it is a 'list' object
## [1] "list"
str(km_b)
## List of 9
##  $ cluster     : Named int [1:506] 1 1 1 2 2 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
##  $ centers     : num [1:4, 1:14] -0.377 -0.408 0.859 -0.199 -0.333 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "1" "2" "3" "4"
##   .. ..$ : chr [1:14] "crim" "zn" "indus" "chas" ...
##  $ totss       : num 7070
##  $ withinss    : num [1:4] 1102 623 1380 341
##  $ tot.withinss: num 3446
##  $ betweenss   : num 3624
##  $ size        : int [1:4] 222 98 152 34
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
# --> km seems to be a 'list' of 'lists', with 506 observations.
# More detailed info:
#?kmeans()

# subset 'cluster' from km (list) and add it as a column to our 'boston_scaled' dataframe
km_b_clust <- km_b$cluster
km_b_clust <- as.data.frame(km_b_clust)
boston_scaled_bonus <- mutate(boston_scaled_bonus, km_b_clust)

# Run linear discriminant analysis with clusters as target classes and all other variables of the data set as pexplanatory variables
lda.fit_bonus <- lda(km_b_clust ~ ., data = boston_scaled_bonus)

# print the lda.fit object
lda.fit_bonus
## Call:
## lda(km_b_clust ~ ., data = boston_scaled_bonus)
## 
## Prior probabilities of groups:
##          1          2          3          4 
## 0.43873518 0.19367589 0.30039526 0.06719368 
## 
## Group means:
##         crim         zn      indus       chas        nox          rm
## 1 -0.3773550 -0.3325348 -0.3302399 -0.2723291 -0.3380896 -0.07661724
## 2 -0.4083186  1.5993013 -1.0868786 -0.2321546 -1.0252997  0.85762197
## 3  0.8588075 -0.4872402  1.1204442 -0.2723291  1.0691487 -0.50270159
## 4 -0.1985497 -0.2602436  0.2799956  3.6647712  0.3830784  0.27566811
##           age         dis          rad        tax     ptratio      black
## 1 -0.06014166  0.07356985 -0.588184980 -0.6041027 -0.06143853  0.2840492
## 2 -1.22890399  1.28526187 -0.598658222 -0.6419128 -0.74348995  0.3532213
## 3  0.79691805 -0.84588600  1.244794751  1.3179961  0.65687472 -0.6809675
## 4  0.37213224 -0.40333817  0.001081444 -0.0975633 -0.39245849  0.1715427
##        lstat        medv
## 1 -0.2086903  0.03252933
## 2 -0.9364928  0.91875080
## 3  0.9453522 -0.76810974
## 4 -0.1643525  0.57334091
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## crim    -0.034836228  0.01232052  0.13192586
## zn      -0.267361320  0.15788045  1.20991564
## indus    0.020533059 -0.58778276  0.21069583
## chas     5.828653032  0.14949639  0.09257929
## nox      0.026703754 -0.34585368  0.54345497
## rm      -0.068843435  0.07156541  0.37000239
## age      0.098415418 -0.04458773 -0.53934235
## dis      0.053100700  0.23565203  0.47758361
## rad      0.064174139 -0.72837621  0.11551231
## tax      0.033261463 -0.60257854  0.70761129
## ptratio -0.007773957 -0.10230899  0.07726420
## black    0.015155025  0.03258028 -0.02842152
## lstat   -0.133125916 -0.26354444  0.52415255
## medv    -0.146715844  0.14530719  0.54132868
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.8121 0.1472 0.0407
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes / clusters as numeric
classes_b <- as.numeric(boston_scaled_bonus$km_b_clust)

# plot the lda results
plot(lda.fit_bonus, col = classes_b, pch = classes_b)

plot(lda.fit_bonus, dimen = 2, col = classes_b, pch = classes_b)
lda.arrows(lda.fit_bonus, myscale = 1)

plot(lda.fit_bonus, dimen = 2, col = classes_b, pch = classes_b)
lda.arrows(lda.fit_bonus, myscale = 9)

When clustering the Boston data to 4 clusters with the k-means algorithm the most influential linear separators seem to be:

  1. river tracting (chas)
  2. the index of accessibility to radial highways (rad)
  3. full-value property-tax rate (tax)

The results are somewhat different from our results with 2-cluster algorithm. When looking at the full matrix of biplots (LD1, LD2, LD3) we could suspect there being either two or five different clusters. From our earlier analyses we do know though that two clusters seem to be the best clustering option.

OBS! The results changed after running the R code again. To globally answer our assignment, we can say that the three most influential variables are the ones with the longest arrows on the LD1 vs LD2 plot. For easier assessment of the arrows, the second LD1 vs LD2 plot has scaled arrows (see myscale argument).

Super-Bonus: 3D-plotting of classifications and clusters

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
# install.packages('plotly')
library('plotly')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

We drew a very cool dynamic 3D plot with the instructions our our course assignment! Let’s customize it:

# add colors
# set the color to be the crime classes of the train set
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes) %>% layout(title = 'Plot A')

To set the colors to be the clusters of the kmeans, we’ll join data with left_join() of dplyr:

# set the color to be the clusters of the kmeans (2 clusters)
# subset 'cluster' from km (list) and add it as a column to our 'boston_scaled' dataframe
km_sb_clust <- km$cluster
km_sb_clust <- as.data.frame(km_sb_clust)
boston_scaled_sb <- mutate(boston_scaled, km_sb_clust)
# merge data
train_sb = train %>% left_join(boston_scaled_sb)
# check data
head(train_sb);dim(train_sb)
##            zn      indus       chas        nox         rm        age        dis
## 1 -0.48724019  1.0149946 -0.2723291 -0.1958536 -0.0606794 -0.1376575 -0.1761129
## 2 -0.48724019  1.5674443 -0.2723291  0.5980871  0.2467426  1.0773117 -0.7961887
## 3  0.37030254 -1.0446662 -0.2723291  0.7965722  0.7932707  1.1163897 -0.8473829
## 4  0.04872402 -0.7385595 -0.2723291 -1.2573178 -0.9829455 -1.1288166  1.2836322
## 5  0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.5630867 -1.0506607  0.7863653
## 6 -0.48724019  1.0149946 -0.2723291  1.2539511 -1.9991462  1.1163897 -1.0474104
##          rad        tax    ptratio     black        lstat        medv    crime
## 1  1.6596029  1.5294129  0.8057784 0.4406159 -0.267896200  0.05079791     high
## 2 -0.6373311  0.1706618  1.2676838 0.4202424 -0.007430722 -0.36237562 med_high
## 3 -0.5224844 -0.8558183 -2.5199404 0.3861769 -0.805631380  0.82278004 med_high
## 4 -0.6373311 -0.3752120  0.2053014 0.4406159  0.061186528 -0.55808940  med_low
## 5 -0.5224844 -0.5769480 -1.5037485 0.3705134  0.428078760 -0.09055093  med_low
## 6  1.6596029  1.5294129  0.8057784 0.1779505  2.516003640 -1.34094452     high
##         crim km_sb_clust
## 1  0.2569871           1
## 2 -0.3805669           1
## 3 -0.3437607           2
## 4 -0.4043440           2
## 5 -0.4091990           2
## 6  1.2463082           1
## [1] 404  16
# target classes / clusters as numeric
classes_sb <- as.numeric(train_sb$km_sb_clust)

# Draw the 3D plot
model_predictors_sb <- dplyr::select(train_sb, -crime, -crim, -km_sb_clust)
# check the dimensions
dim(model_predictors_sb)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product_sb <- as.matrix(model_predictors_sb) %*% lda.fit$scaling
matrix_product_sb <- as.data.frame(matrix_product_sb)

#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
# install.packages('plotly')
library('plotly')
plot_ly(x = matrix_product_sb$LD1, y = matrix_product_sb$LD2, z = matrix_product_sb$LD3, type= 'scatter3d', mode='markers', color = classes_sb)  %>% layout(title = 'Plot B')

Plot A shows four different colors depicting our four different crime rate quartiles: ‘low = 1’, ‘medium low = 2’, ‘medium high = 3’, ‘high = 4’. Plot B shows coloring by our optimal two clusters. Comparing plots A and B we notice that the dark blue cluster (Plot B) seems to include some observations of ‘medium low’ crime rate classification. And the yellow cluster of plot B seems to include all low, medium low and medium high crime rate classes. Some medium high crime rate classes seem to have been clustered to the dark blue cluster of plot B. In summary, our model seems to cluster separately quite nicely our high crime rate suburbs.

Ch5 Dimensionality reduction techniques

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
date()
## [1] "Mon Dec 12 22:37:20 2022"

Here we go again…

Libraries used for Ch5

Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).

#install.packages(c("tidyverse","GGally","readr", "MASS", "corrplot", "psych"))

Load the needed R packages before getting to work:

#load required packages
library(tidyverse)
library(GGally)
#library(dplyr)
library(ggplot2)
library(corrplot)
#library(stringr)
library(psych) 
library(FactoMineR)

Data analysis (15p.)

1. Loading and describing the data ‘human’

# load the 'human' data
human <- read.csv("human_row_names.csv", row.names = 1)
# Alternative data url
# human <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", row.names = 1)

# explore the dataset
str(human);dim(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## [1] 155   8

Graphical overview of the data

# summary of the data
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# Summary plot matrix with correlation and density plots
p1 <- ggpairs(human, mapping = aes(alpha=0.3), 
              title="Descriptives and Pearson correlation",
              lower = list(combo = wrap("facethist", bins = 20)))
p1

# compute the correlation matrix and visualize it with corrplot

#correaltion matrix
cor_matrix_h <- cor(human, method='spearman')

#p-values that correspond our correlation coefficients
cor_pmatrix_h <- corr.test(human, method = "spearman")$p

# visualize
corrplot(cor_matrix_h, type='upper', tl.col='black', col=COL2('BrBG'), addCoef.col='black', number.cex = 0.6, p.mat = cor_pmatrix_h, sig.level = 0.05)

All our variables are skewed. The significance values (p-values) seem to differ a little between the two plots. In our first plot p-values are determined as

  • “***” p < 0.001
  • “**” p < 0.01
  • “*” p < 0.05
  • “.” p < 0.10
  • ” ” otherwise

Our second plot shows all Spearman’s correlation coefficients in numbers, the color and sizes of the circles, and correlations with p < 0.05 are seen stroke out.

This difference is due to ggpairs() (1st plot) using Pearson’s correlations and us assigning corrplot() (2nd plot) with Spearman’s correlation derived data. Since our variables seemed skewed, I prefer using Spearman’s correlation for bivariate correlation analysis.

There seems to be several strong and statistically significant correlations:

  • Higher maternal mortality (Mat.Mor) is correlated with:
    • lower ratio of females to males with at least secondary education (Edu2.FM)
    • lower expected years of schooling (Edu_Exp)
    • lower life expectancy (Life_Exp)
    • lower gross national income per capita (GNI)
    • higher adolescent birth rate (Ado.Birth)
  • Higher adolescent birth rate is correlated with:
    • lower ratio of females to males with at least secondary education
    • lower expected years of schooling
    • lower life expectancy
    • lower GNI
  • Higher GNI seems to associated with:
    • higher ratio of females to males with at least secondary education
    • higher expected years of schooling
    • higher life expectancy
  • It seems that ratio of females to males in the labour force (Labo.FM) nor percentage of female representatives in parliament (Parli.F) are not statistically significantly associated with any of the other variables.

2. Principal component analysis with raw data

I have had a hard time understanding PCA, but our course material had a very clear intro to this: (Quoted directly from our course material:) *“Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).

There are two functions in the default package distribution of R that can be used to perform PCA: princomp() and prcomp(). The prcomp() function uses the SVD and is the preferred, more numerically accurate method.

Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.”*

We’ll first run a PCA on raw, non-standardized data (part 2) and then on standardized data (part 3). Finally, we’ll interpret the results on part 4 of this assignment.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# create and print out a summary of pca_human (= variability captured by the principal components)
s <- summary(pca_human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 1)*100
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, choices=1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

All the data is clustered to the upper right corner of the plot. PC1 explains 100% of the total variance of the dataset. Because of a vector parallel to PC1, we know that GNI contributes mostly and only to PC1. This means that knowing the position of an observation in relation to PC1, it would be possible to have an accurate view of how the observation is related to other observations in our sample and basically we would only need GNI-information to determine that. Since the data is not scaled the high influence of GNI to PC1 is most probably due to GNI having the mos variance of the variables in our dataset!

3. Principal component analysis with standardized data

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
# create and print out a summary of pca_human (= variability captured by the principal components)
s_hs <- summary(pca_human_std)
summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# rounded percentanges of variance captured by each PC
pca_pr_hs <- round(1*s_hs$importance[2, ], digits = 1)*100
pca_pr_hs
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
##  50  20  10  10  10   0   0   0
# create object pc_lab to be used as axis labels
pc_lab_hs <- paste0(names(pca_pr_hs), " (", pca_pr_hs, "%)")

# draw a biplot
biplot(pca_human_std, choices=1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_hs[1], ylab = pc_lab_hs[2])

The plot with standardized data look very different from the earlier biplot made with unstandardized data. Also, the variability captured by the principal components seems to be distributed more realistically between PCs. The results between unstandardized and standardized data PCA are different since PCA requires data scaling for variable comparison.

For the PCA with standardized data:

  • PC1 explains 54 % of variance in the dataset.
  • PC2 explains 16 % of variance in the dataset.
  • PC3 explains 10 % of variance in the dataset.
  • PC4 explains 8 % of variance in the dataset.
  • PC5 explains 5 % of variance in the dataset.
  • PC6 explains 4 % of variance in the dataset.
  • PC7 explains 3 % of variance in the dataset.
  • PC8 explains 1 % of variance in the dataset.

In summary, the data should be scaled before PCA for the analysis to be trustworthy.

4. Interpretations

I’ll elaborate on the results of the PCA with standardized data:

PC1 explains the most (53,6%) of the variability in the dataset, and the variables that mostly affect it are:

  • Expected years of schooling (positive effect)
  • Life expectancy at birth (positive effect)
  • GNI (positive effect)
  • Ratio of females to males with at least secondary education (positive effect)
  • Maternal mortality (negative effect)
  • Adolescent birth rate (negative effect)

The four variables with a positive effect on PC1 are positively correlated with each other and the ones with a negative effect (two) are correlated positively with each other and negatively with the prior four.

PC2 explains the second most (16,2%) of the variability in the dataset, and the variables that mostly affect it are:

  • Ratio of females to males in the labour force (positive effect)
  • Percentage of female representatives in parliament (positive effect)

These two variables are positively correlated with each other and not with any of the variables affecting PC1 (since the arrows are perpendicular to the others).

5. Analysis of tea dataset

According to our assignment the tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

We’ll first load the data (wrangled by our course professor K.V.) and then start exploring and analyzing it!

# load the data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# Structure and dimensions of the data
str(tea);dim(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## [1] 300  36
# View data
View(tea)

Let’s visualize the data:

# summary of the data
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
# Summary plot matrix grouped by sex and age quartile

tea_long <- tea %>% 
  pivot_longer(cols=-c(sex,age_Q, age), names_to = "variable",values_to = 'value') %>%
  group_by(sex,age_Q,variable,value) %>%
  summarise(n=n())
tea_long
## # A tibble: 764 × 5
## # Groups:   sex, age_Q, variable [330]
##    sex   age_Q variable         value                   n
##    <fct> <fct> <chr>            <fct>               <int>
##  1 F     +60   always           always                  2
##  2 F     +60   always           Not.always             21
##  3 F     +60   breakfast        breakfast               8
##  4 F     +60   breakfast        Not.breakfast          15
##  5 F     +60   dinner           dinner                  1
##  6 F     +60   dinner           Not.dinner             22
##  7 F     +60   diuretic         diuretic               14
##  8 F     +60   diuretic         Not.diuretic            9
##  9 F     +60   effect.on.health effect on health        6
## 10 F     +60   effect.on.health No.effect on health    17
## # … with 754 more rows
# Barplots of all variables

# visualize the dataset
pivot_longer(tea, cols = -c(age)) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free", ncol=6) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Unfortunately there is so much data we don’t get good looking plots. We’ll have to choose some of the most interesting variables to plot:

# column names to keep in the dataset
keep_columns <- c("age_Q", "sex", "Tea", "How", "price", "SPC", "Sport", "frequency", "relaxing", "healthy", "sophisticated")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, all_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##    age_Q    sex            Tea         How                  price    
##  +60  :38   F:178   black    : 74   alone:195   p_branded      : 95  
##  15-24:92   M:122   Earl Grey:193   lemon: 33   p_cheap        :  7  
##  25-34:69           green    : 33   milk : 63   p_private label: 21  
##  35-44:40                           other:  9   p_unknown      : 12  
##  45-59:61                                       p_upscale      : 53  
##                                                 p_variable     :112  
##                                                                      
##            SPC               Sport           frequency          relaxing  
##  employee    :59   Not.sportsman:121   +2/day     :127   No.relaxing:113  
##  middle      :40   sportsman    :179   1 to 2/week: 44   relaxing   :187  
##  non-worker  :64                       1/day      : 95                    
##  other worker:20                       3 to 6/week: 34                    
##  senior      :35                                                          
##  student     :70                                                          
##  workman     :12                                                          
##         healthy              sophisticated
##  healthy    :210   Not.sophisticated: 85  
##  Not.healthy: 90   sophisticated    :215  
##                                           
##                                           
##                                           
##                                           
## 
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free", ncol=6) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Again, quoted a very insightful part of our course material: Multiple Correspondence Analysis (MCA) is a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction.

Since there are multiple variables. We’ll choose and focus on only the ones we previously chose: “age_Q”, “sex”, “Tea”, “How”, “price”, “SPC”, “Sport”, “frequency”, “relaxing”, “healthy”, “sophisticated” and run a MCA.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = TRUE)
## Warning: ggrepel: 21 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# summary of the model
mca
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 11 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$ind"            "results for the individuals"     
## 8  "$ind$coord"      "coord. for the individuals"      
## 9  "$ind$cos2"       "cos2 for the individuals"        
## 10 "$ind$contrib"    "contributions of the individuals"
## 11 "$call"           "intermediate results"            
## 12 "$call$marge.col" "weights of columns"              
## 13 "$call$marge.li"  "weights of rows"
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = TRUE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.205   0.179   0.138   0.129   0.126   0.125   0.114
## % of var.              8.059   7.023   5.404   5.073   4.946   4.910   4.486
## Cumulative % of var.   8.059  15.082  20.485  25.558  30.504  35.414  39.899
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.108   0.104   0.102   0.099   0.096   0.090   0.089
## % of var.              4.251   4.068   3.994   3.889   3.790   3.521   3.508
## Cumulative % of var.  44.150  48.218  52.212  56.101  59.891  63.412  66.920
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.085   0.082   0.076   0.074   0.071   0.068   0.065
## % of var.              3.353   3.209   3.003   2.896   2.809   2.667   2.536
## Cumulative % of var.  70.273  73.482  76.485  79.382  82.190  84.857  87.393
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.061   0.056   0.055   0.053   0.044   0.028   0.024
## % of var.              2.406   2.208   2.160   2.071   1.720   1.091   0.952
## Cumulative % of var.  89.798  92.007  94.166  96.237  97.957  99.048 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
## 1         |  0.205  0.068  0.009 |  0.966  1.740  0.207 |  0.345  0.289  0.027
## 2         |  0.298  0.145  0.036 |  0.576  0.618  0.135 |  0.099  0.024  0.004
## 3         | -0.118  0.023  0.006 |  0.135  0.034  0.007 |  0.034  0.003  0.000
## 4         | -0.497  0.401  0.183 | -0.191  0.068  0.027 | -0.017  0.001  0.000
## 5         | -0.237  0.091  0.031 |  0.274  0.140  0.042 |  0.078  0.015  0.003
## 6         | -0.800  1.040  0.253 |  0.002  0.000  0.000 |  0.364  0.322  0.053
## 7         | -0.020  0.001  0.000 |  0.542  0.549  0.107 |  0.441  0.472  0.070
## 8         |  0.197  0.063  0.014 |  0.231  0.099  0.019 |  0.337  0.275  0.041
## 9         |  0.195  0.062  0.014 |  0.569  0.603  0.119 |  0.617  0.922  0.140
## 10        |  0.496  0.400  0.090 |  0.323  0.195  0.038 |  0.553  0.742  0.112
##            
## 1         |
## 2         |
## 3         |
## 4         |
## 5         |
## 6         |
## 7         |
## 8         |
## 9         |
## 10        |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test  
## +60       |   1.564  13.735   0.355  10.301 |  -1.327  11.337   0.255  -8.737 |
## 15-24     |  -1.100  16.446   0.535 -12.650 |  -0.578   5.203   0.148  -6.643 |
## 25-34     |  -0.125   0.159   0.005  -1.182 |   0.668   5.214   0.133   6.310 |
## 35-44     |   0.543   1.741   0.045   3.682 |   0.521   1.842   0.042   3.535 |
## 45-59     |   0.470   1.992   0.056   4.107 |   0.601   3.730   0.092   5.247 |
## F         |  -0.088   0.203   0.011  -1.836 |  -0.402   4.872   0.236  -8.393 |
## M         |   0.128   0.297   0.011   1.836 |   0.586   7.108   0.236   8.393 |
## black     |   0.708   5.484   0.164   7.008 |  -0.092   0.107   0.003  -0.913 |
## Earl Grey |  -0.375   4.017   0.254  -8.716 |   0.000   0.000   0.000  -0.010 |
## green     |   0.607   1.795   0.046   3.689 |   0.209   0.245   0.005   1.272 |
##             Dim.3     ctr    cos2  v.test  
## +60         0.600   3.011   0.052   3.950 |
## 15-24       0.209   0.882   0.019   2.399 |
## 25-34      -0.358   1.945   0.038  -3.380 |
## 35-44      -0.020   0.003   0.000  -0.134 |
## 45-59      -0.271   0.984   0.019  -2.364 |
## F          -0.247   2.393   0.089  -5.159 |
## M           0.360   3.491   0.089   5.159 |
## black       0.673   7.394   0.149   6.664 |
## Earl Grey  -0.130   0.719   0.030  -3.020 |
## green      -0.750   4.086   0.069  -4.557 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## age_Q     | 0.769 0.537 0.103 |
## sex       | 0.011 0.236 0.089 |
## Tea       | 0.255 0.007 0.185 |
## How       | 0.137 0.083 0.061 |
## price     | 0.106 0.115 0.351 |
## SPC       | 0.697 0.618 0.328 |
## Sport     | 0.140 0.061 0.104 |
## frequency | 0.019 0.135 0.104 |
## relaxing  | 0.069 0.120 0.034 |
## healthy   | 0.033 0.030 0.081 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali") #individuals (ind) invisible = plots variables

plot(mca, invisible=c("var"), graph.type = "classic") #variables (var) invisible = plots individuals

Our first and fourth plots (MCA factor map) show variable categories with similar profiles grouped together;

  • ‘non-workers’, ‘+60’ year olds and drinking tea in an ‘other’ way (i.e. not alone, not with lemon nor milk) have similar profiles. Also ‘student’ and ‘15-24’ year old seem to have similar profiles.

  • ‘non-workers’, ‘+60’ year olds and ‘other’ profiles seem to be negatively correlated with ‘student’ and ‘15-24’ yo categories (opposite sides of plot origin), and especially with ‘25-34’ year old and ‘workman’.

  • ‘non-workers’, ‘+60’ year olds, ‘other’ kind of drinkers, ‘student’, ‘15-24’ year olds, ‘senior’, and ‘p_unknown’ (tea price unknown) seem to be the variable categories that are most represented in our data (tehy are the farthers of the origo).

From the third plot (correlation between variables and principal dimensions) we can see that

  • SPC (socio-professional category) and age are correlated with each other
  • of all the variables SPC and age are the least correlated with the first dimensions (dim1 and 2) which explain the most (15%) of the variance of the chosen data.

To assess whether the variable categories differ significantly from each other we’ll draw confidence ellipses, which is possible with ‘FactoMineR’:

plotellipses(mca)

The plots are hard to read, so we’ll focus on specific variables 4 at a time.

plotellipses(mca, keepvar = c("sex", "Tea", "age_Q", "How"))

plotellipses(mca, keepvar = c("Sport", "price", "frequency", "SPC"))

plotellipses(mca, keepvar = c("relaxing", "healthy", "sophisticated"))

This shows us:

  • categories ‘sex’ and ‘How’ seem to be different from each other.
  • For the ‘age_Q’ variable, ‘15-24’ and ‘+60’ categories seem different from the other categories.
  • ‘Earl Grey’ seems to differ from the other tea varieties
  • Drinking ‘Other’ way tea seems to differ from drinking alone, with lemon or with milk.
  • ‘students’ on ‘non-workers’ are different and different from the other SPC-categories.
  • Doing sports or not differs.
  • Frequency categories don’t seem to differ significantly from each other.
  • Tea with unknown price ‘p_unknown’ seems to differ from the other price categories.
  • All three product perception variables ‘healthy’, ‘relaxing’ and ‘sophisticated’ seem to have differing categories.

✓ ✓ ✓ ✓ ✓ Ch5 done!

Ch6 Analysis of longitudinal data

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
date()
## [1] "Mon Dec 12 22:37:41 2022"

Here we go again!

Sweet, sweet longitudinal data…

For longitudinal data we cannot assume observations to be independent of each other, since longitudinal data means multiple observations of the same individuals. For this assignment we’ll first do some data wrangling to transform our data form wide form to long for. Then we’ll analyse our data with linear mixed effects statistical models; a random intercept model and a random intercept and slope model.

Libraries used for Ch6

Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).

#install.packages(c("tidyverse","GGally","readr", "MASS", "corrplot", "psych"))

Load the needed R packages before getting to work:

#load required packages
library(tidyverse)
#library(GGally)
#library(dplyr)
library(ggplot2)
#library(corrplot)
#library(stringr)
#library(psych) 
#library(FactoMineR)
library(lme4)

Data wrangling (5p.)

See meet_and_repeat.R on my repository for the data wrangling part.

Data Analysis (15p.)

1. Analysis from ch 8 of MABS - Analysis of Longitudinal Data I: Graphical Displays and Summary Measure Approach

For the first part of the analysis, graphical displays and summary measure approach to longitudinal data we’ll use the wrangled ‘rats’ dataset now found in my repository.

# read long format data of rats
ratsL <- read.csv('rats_long.csv', row.names = 1)

# glimpse and dimensions
head(ratsL);dim(ratsL)
##   ID Group  WD Weight Time
## 1  1     1 WD1    240    1
## 2  2     1 WD1    225    1
## 3  3     1 WD1    245    1
## 4  4     1 WD1    260    1
## 5  5     1 WD1    255    1
## 6  6     1 WD1    260    1
## [1] 176   5
# structure and summary of the data
str(ratsL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
summary(ratsL)
##        ID            Group           WD                Weight     
##  Min.   : 1.00   Min.   :1.00   Length:176         Min.   :225.0  
##  1st Qu.: 4.75   1st Qu.:1.00   Class :character   1st Qu.:267.0  
##  Median : 8.50   Median :1.50   Mode  :character   Median :344.5  
##  Mean   : 8.50   Mean   :1.75                      Mean   :384.5  
##  3rd Qu.:12.25   3rd Qu.:2.25                      3rd Qu.:511.2  
##  Max.   :16.00   Max.   :3.00                      Max.   :628.0  
##       Time      
##  Min.   : 1.00  
##  1st Qu.:15.00  
##  Median :36.00  
##  Mean   :33.55  
##  3rd Qu.:50.00  
##  Max.   :64.00
# Convert categorical data to factors
## ID and Group
ratsL$ID <- factor(ratsL$ID)
ratsL$Group <- factor(ratsL$Group)

We have a dataset with 6 rats and 11 weight (grams) observations per each rat, on different weeks. The rats are divided in 3 different groups, meaning three different diets. Our data is in a longitudinal format which means that we can study possible weight differences between the rats in these 3 different groups and the possible change of weight by time.

Let’s plot the Weights of each rat by time and groups.

# Draw the plot
ggplot(ratsL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(ratsL$Weight), max(ratsL$Weight)))

The weight of each rat increases though the days. Group 1 has the most rats, group 2 and 3 include both 4 rats. Both the baseline and final weights of the rats increase by group, group 1 having rats with the lowest weights and group 3 the highest. One rat in group 2 has the highest baseline and final weight.

Standardizing to assess the tracking phenomenon

Higher baseline weight may mean higher final weight. This is called the tracking phenomenon. The plots drawn earlier sufggest this might be the case in our data. We’ll can assess this more clearly with plotting standardized data.

We’ll standardize our data with scale():

\[standardised(x) = \frac{x - mean(x)}{ sd(x)}\]

# Standardise the variable Weight by groups´
ratsL <- ratsL %>%
  group_by(Time) %>%
  mutate(Weight_std= (scale(Weight))) %>%
  ungroup()

# Glimpse the data
head(ratsL)
## # A tibble: 6 × 6
##   ID    Group WD    Weight  Time Weight_std[,1]
##   <fct> <fct> <chr>  <int> <int>          <dbl>
## 1 1     1     WD1      240     1         -1.00 
## 2 2     1     WD1      225     1         -1.12 
## 3 3     1     WD1      245     1         -0.961
## 4 4     1     WD1      260     1         -0.842
## 5 5     1     WD1      255     1         -0.882
## 6 6     1     WD1      260     1         -0.842
# Plot again with the standardised weight
ggplot(ratsL, aes(x = Time, y = Weight_std, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(name = "standardized Weight")

After standardizing our data the difference between the baseline and final weights looks smaller in the plot.

Summary graphs

As our course material indicates, we can produce graphs showing average (mean) profiles for each group along with some indication of the variation of the observations at each time point (the standard error of mean):

\[se = \frac{sd(x)}{\sqrt{n}}\]

# Summary data with mean and standard error of bprs by treatment and week 
ratsLs <- ratsL %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = (sd(Weight)/sqrt(length(Weight))) ) %>%
  ungroup()

# Glimpse the data
glimpse(ratsLs)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se    <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.87…
# Plot the mean profiles
ggplot(ratsLs, aes(x = Time, y = mean, color=Group, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  # legend inside plot
  #theme(legend.position = c(0.8,0.4)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

There is no significant overlap between the groups, although there is definitely less difference between groups 2 and 3 compared to groups 2 or 3 to group 3. There is also the least variation between the weights of rats in group 3 and most between the weights of rats in group 2. The weight of the rats increases in each group during the study.

Summary measure approach

Let’s look at the weight measurements after the start of the study (= we’ll exclude baseline - i.e. week 1- weights). We’ll calculate the mean weight for each rat, and then draw boxplots of the mean weight for each diet group.

# Create a summary data by group and subject with mean as the summary variable (ignoring baseline week 1)
ratsL_sum <- ratsL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise(Weight_mean=mean(Weight)) %>%
  ungroup()

# Glimpse the data
glimpse(ratsL_sum)
## Rows: 16
## Columns: 3
## $ Group       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID          <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Weight_mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 44…
# Draw a boxplot of the mean versus diet
ggplot(ratsL_sum, aes(x = Group, y = Weight_mean, col=Group)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), days 8-64")

The mean summary measure is more variable in the second treatment group and its distribution in this group is somewhat skew. All of the three groups seem to have an outlier - the mean weight of the rat is more than 1.5 times the interquartile range above the upper quartile or below the lower quartile. These outliers might bias the conclusions from the later comparisons of the groups so we’ll remove them from out analysis and redraw the boxplots. Outlier in group 3 is within the filtering limits (min and max of the means of all groups) so we’ll have to search the rat ID for that outlier to exclude it from the group.

# define outlier from group 3
g3 <- ratsL_sum %>% filter(Group==3)
out3 <- min(g3$Weight_mean)

# Create a new data by filtering the outliers
ratsL_sum2 <- ratsL_sum %>%
  filter(250 < Weight_mean & Weight_mean < 560 & Weight_mean != out3)


# Draw a boxplot of the mean versus diet
ggplot(ratsL_sum2, aes(x = Group, y = Weight_mean, col=Group)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), days 8-64")

There is less within group variation after excluding the outliers and we may proceed with our analysis.

T-test or analysis of variance (ANOVA) or analysis of covariance (ANCOVA)

Based on our previous plots, there seems to be a difference in weights in our diet groups. We’ll now test if there is a statistical difference and whether it is statistically significant.

T-test tests whether there is a statistical difference between two groups and ANOVA test whether there is a difference between three or more groups. Thus, we’ll use ANOVA for our analysis.

ANOVA assumes homogeneity of variance - the variance in groups 1-3 should be similar. We’ll test this with Levene’s test before proceeding further with ANOVA.

# Levene's test from 'car' package
library(car)
leveneTest(Weight_mean ~ Group, ratsL_sum2)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  0.3727  0.698
##       10

The Levene’s test returned a nonsignificant p value for the difference of variances between groups: we’ll accept the test’s null hypothesis -> there is no difference between the variance of the groups 1-3. > We may proceed with ANOVA.

# Fit the linear model with the mean Weight as the response 
fit <- lm(Weight_mean ~ Group, data = ratsL_sum2)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: Weight_mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 176917   88458  2836.4 1.687e-14 ***
## Residuals 10    312      31                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This shows there seems to be a significant difference in mean weights between diet groups.

However, we should take into account baseline weight to assess the differences in mean weights of groups 1-3 while accounting for the tracking phenomenon. We’ll add the baseline weight from the original dataset rats and run ANCOVA.

# Load original wide form rats data
rats_og <- as_tibble(read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t'))
rats_og$ID <- factor(rats_og$ID)
#rats_og$Group <- factor(rats_og$Group)

# Add the baseline from the original data as a new variable to the summary data
join_vars <- c("ID","WD1")
ratsL_sum3 <- ratsL_sum2 %>%
  left_join(rats_og[join_vars], by='ID') 
# Rename column
ratsL_sum3 <- ratsL_sum3 %>%
  rename('Weight_baseline' = 'WD1')

# Fit the linear model with the mean Weight as the response 
fit2 <- lm(Weight_mean ~ Weight_baseline + Group, data = ratsL_sum3)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit2)
## Analysis of Variance Table
## 
## Response: Weight_mean
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## Weight_baseline  1 174926  174926 6630.941 3.216e-14 ***
## Group            2   2065    1033   39.147 3.628e-05 ***
## Residuals        9    237      26                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Baseline weight is strongly correlated with mean weight later in the study. However, diet (Group) seems to keep it’s statistically significant association to weight even after taking baseline weight into account. The result is in line with our graphical overview of the dataset after exclusion of outliers.

2. Analysis from ch 9 of MABS - Analysis of Longitudinal Data II: Linear Mixed Effects Models for Normal Response Variables

In part II of this assignment we’ll use the dataset BPRS, which includes psychiatric patient data. It includes a brief psychiatric rating scale (BPRS) score prior to treatment and BPRS from 8 weeks during treatment. The patients (n=40) have been randomly assigned to treatment arm 1 or 2 and we are interested whether there is a difference in BPRS scores depending on the received treatment. A lower score means less symptoms.

We’ll first explore the data and then analyse it to determine whether the BPRS scores of the two treatment groups differ.

# read in the data
BPRSL <- read.csv('BPRS_long.csv', row.names=1)
head(BPRSL);dim(BPRSL)
##   treatment subject weeks bprs week
## 1         1       1 week0   42    0
## 2         1       2 week0   58    0
## 3         1       3 week0   54    0
## 4         1       4 week0   55    0
## 5         1       5 week0   72    0
## 6         1       6 week0   48    0
## [1] 360   5
View(BPRSL)

# Factor variables subject and treatment group
BPRSL$subject <- factor(BPRSL$subject)
BPRSL$treatment <- factor(BPRSL$treatment)

# Glimpse the data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

From glimpsing and viewing the data we notice that in treatment arms the subjects are coded 1-20. However, these individuals with the same subject code are different individuals since all participating patients were randomized to either treatment 1 or treatment 2. To avoid mixups in our analysis we’ll have to create a new individual-specific factor ID-variable.

# New ID-variable
BPRSL <- BPRSL %>%
  mutate(ID = as.factor(paste0(subject, "_t",treatment)))

# check data
glimpse(BPRSL)
## Rows: 360
## Columns: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ID        <fct> 1_t1, 2_t1, 3_t1, 4_t1, 5_t1, 6_t1, 7_t1, 8_t1, 9_t1, 10_t1,…
summary(BPRSL)
##  treatment    subject       weeks                bprs            week  
##  1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   Mode  :character   Median :35.00   Median :4  
##            4      : 18                      Mean   :37.66   Mean   :4  
##            5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18                      Max.   :95.00   Max.   :8  
##            (Other):252                                                 
##        ID     
##  1_t1   :  9  
##  1_t2   :  9  
##  10_t1  :  9  
##  10_t2  :  9  
##  11_t1  :  9  
##  11_t2  :  9  
##  (Other):306

Let’s plot!

# Plot the data
ggplot(BPRSL, aes(x = week, y = bprs, group = ID,  col=treatment)) +
  geom_line(aes(linetype=treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0,8,2)) +
  scale_y_continuous(name="BPRS") +
  theme(legend.position = "right")

BPRS seems to decrease during the treatment period in both treatment arms. There is no clear difference to be seen between the groups. We’ll divide the plot into two separate plots by treatment.

Still, we cannot draw trustworthy conclusions of these two treatment arms affecting BPRS. Let’s proceed.

Linear Mixed Effects Models

Now things are getting serious. LET’S GO!

Linear model

When fitting a linear model to our data we assume independence of the observations (here BPRS measurements). Since the data is longitudinal we know they are not independent. We’ll ignore this knowledge for now and proceed with multiple regression modeling!

# create a regression model BPRSL_reg
BPRSL_reg <- lm(bprs ~ week + treatment, data=BPRSL)

# print out a summary of the model
summary(BPRSL_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

BPRS score is our target variable and time (weeks) and treatment arm are our explanatory variables. Week seems to have a statistically significant association with BPRS score. Treatment arms seem to not differ from each other when comparing BPRS scores. However, this analysis assumes independence of observations which is highly unlikely in our dataset provided we are studying repeated measures of BPRS scores (longitudinal data).

We’ll now proceed to analysis where the assumption of independence within observations is not required, we’ll start with running a random intercept model to assess whether there is a difference between the two treatment arms. Fitting a random intercept model allows the linear regression fit for each individual to differ in intercept from other individuals. The model takes into account both fixed-effects and random effects of our explanatory variables on our target variable.

The Random Intercept Model

# Create a random intercept model
BPRSL_ref <- lmer(bprs ~ week + treatment + (1 | ID), data = BPRSL, REML = FALSE)

# Print the summary of the model
BPRSL_ref
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | ID)
##    Data: BPRSL
##       AIC       BIC    logLik  deviance  df.resid 
##  2582.903  2602.334 -1286.452  2572.903       355 
## Random effects:
##  Groups   Name        Std.Dev.
##  ID       (Intercept) 9.869   
##  Residual             7.364   
## Number of obs: 360, groups:  ID, 40
## Fixed Effects:
## (Intercept)         week   treatment2  
##     46.4539      -2.2704       0.5722
summary(BPRSL_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | ID)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 97.39    9.869   
##  Residual             54.23    7.364   
## Number of obs: 360, groups:  ID, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.3521  19.750
## week         -2.2704     0.1503 -15.104
## treatment2    0.5722     3.2159   0.178
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.256       
## treatment2 -0.684  0.000
confint(BPRSL_ref)
## Computing profile confidence intervals ...
##                 2.5 %    97.5 %
## .sig01       7.918218 12.645375
## .sigma       6.828332  7.973573
## (Intercept) 41.744598 51.163179
## week        -2.565919 -1.974914
## treatment2  -5.885196  7.029641

In our dataset the minimal BPRS score is 18 and the maximum BPRS score is 95. Variance on ID level is high (97.4) and the standard deviation small (9.9). This confirms that there is high variance between the individual patients’ BPRS scores, which tells us that their baseline symptoms are of very different levels.

One unit change in time (= 1 week passes) is associated with a decrease of 2.3 BPRS points. Individuals on treatment 2 seem to have 0.57 higher BPRS scores but its confidence intervals imply the treatment arm do not differ significantly (CI95% reach both sides of 0). These results are similar to the ones from out linear model. Only the standard error (SE) for time (week) is lower with his model and the SE for treatment group is higher with this model compared to the previous linear model. Which also insinuates that time seems to be statistically significantly associated with BPRS score and treatment arm not.

Let’s fit the random intercept and random slope model to our data:

Random Intercept and Random Slope Model

A nice and clear quote from out material: Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This allows us to account for the individual differences in the individuals symptom (BRPS score) profiles, but also the effect of time.

# create a random intercept and random slope model
BPRSL_ref1 <- lmer(bprs ~ week + treatment + (week | ID), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRSL_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | ID)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.2   2550.4  -1254.6   2509.2      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4655 -0.5150 -0.0920  0.4347  3.7353 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  ID       (Intercept) 167.827  12.955        
##           week          2.331   1.527   -0.67
##  Residual              36.747   6.062        
## Number of obs: 360, groups:  ID, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.9830     2.6470  17.372
## week         -2.2704     0.2713  -8.370
## treatment2    1.5139     3.1392   0.482
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.545       
## treatment2 -0.593  0.000
confint(BPRSL_ref1)
## Computing profile confidence intervals ...
##                  2.5 %     97.5 %
## .sig01      10.3102841 16.7099130
## .sig02      -0.8241816 -0.4128486
## .sig03       1.1562013  2.0277746
## .sigma       5.5925524  6.6009205
## (Intercept) 40.5849755 51.2468100
## week        -2.8150999 -1.7257334
## treatment2  -4.9313699  7.9592153
# perform an ANOVA test on the two models to assess formal differences between them
anova(BPRSL_ref1, BPRSL_ref)
## Data: BPRSL
## Models:
## BPRSL_ref: bprs ~ week + treatment + (1 | ID)
## BPRSL_ref1: bprs ~ week + treatment + (week | ID)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRSL_ref     5 2582.9 2602.3 -1286.5   2572.9                         
## BPRSL_ref1    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding the random slope to this model increased the inter individual variance. Treatment groups seems to stay unsignificant when it comes to BPRS over time. The ANOVA test between our models shows that our model with the random slope argument seems to be a better fit to assess BPRS in these data.

Let’s take interaction of out explanatory variables into account:

Random Intercept and Random Slope Model with interaction

We’ll now fit a random intercept and slope model that allows for a treatment × time interaction, and plot the result.

# create a random intercept and random slope model with the interaction
# week * treatment = all effects and interactions of time and treatment = week + treatment + week:treatment
# random effect that allows individuals (ID) to vary randomly in terms of their baseline BPRS-value and their effects on BPRS over time (="their individual effect of Time on BPRS" = random slope)
BPRSL_ref2 <- lmer(bprs ~ week + treatment + week * treatment + (week | ID), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRSL_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | ID)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.5   2554.5  -1253.7   2507.5      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4747 -0.5256 -0.0866  0.4435  3.7884 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  ID       (Intercept) 164.204  12.814        
##           week          2.203   1.484   -0.66
##  Residual              36.748   6.062        
## Number of obs: 360, groups:  ID, 40
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.9840  16.047
## week             -2.6283     0.3752  -7.006
## treatment2       -2.2911     4.2200  -0.543
## week:treatment2   0.7158     0.5306   1.349
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.668              
## treatment2  -0.707  0.473       
## wek:trtmnt2  0.473 -0.707 -0.668
confint(BPRSL_ref2)
## Computing profile confidence intervals ...
##                       2.5 %     97.5 %
## .sig01           10.2191216 16.4882560
## .sig02           -0.8181446 -0.4021022
## .sig03            1.1186279  1.9764684
## .sigma            5.5925518  6.6009204
## (Intercept)      41.8936922 53.8774190
## week             -3.3816817 -1.8749850
## treatment2      -10.7648856  6.1826634
## week:treatment2  -0.3495621  1.7812288
# perform an ANOVA test on the two models
anova(BPRSL_ref2, BPRSL_ref1)
## Data: BPRSL
## Models:
## BPRSL_ref1: bprs ~ week + treatment + (week | ID)
## BPRSL_ref2: bprs ~ week + treatment + week * treatment + (week | ID)
##            npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_ref1    7 2523.2 2550.4 -1254.6   2509.2                    
## BPRSL_ref2    8 2523.5 2554.6 -1253.7   2507.5  1.78  1     0.1821

The results of our latest model (random intercept and slope, and interaction variable) seems to be similar to our second model (random intercept and slope, no interaction variable). ANOVA between the two models confirms there is no significant difference between them in our data. In summary, this time adding the interaction variable did not improve our model.

Next, we’ll plot the observed BPRS values and the fitted BPRS values.

# draw the plot of BPRSL with the observed BPRS values
ggplot(BPRSL, aes(x = week, y = bprs, group = ID, col=treatment)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
  scale_y_continuous(name = "Observed BPRS") +
  theme(legend.position = "right") +
  facet_grid(. ~ treatment, labeller=label_both)

# Create a vector of the fitted values
Fitted <- fitted(BPRSL_ref)
Fitted1 <- fitted(BPRSL_ref1)
Fitted2 <- fitted(BPRSL_ref2)

# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>% mutate(bprs_fitted_values_BPRSL_ref = Fitted, bprs_fitted_values_BPRSL_ref1 = Fitted1, bprs_fitted_values_BPRSL_ref2 = Fitted2)
head(BPRSL)
##   treatment subject weeks bprs week   ID bprs_fitted_values_BPRSL_ref
## 1         1       1 week0   42    0 1_t1                     50.39349
## 2         1       2 week0   58    0 2_t1                     53.42798
## 3         1       3 week0   54    0 3_t1                     46.52190
## 4         1       4 week0   55    0 4_t1                     61.06652
## 5         1       5 week0   72    0 5_t1                     60.96188
## 6         1       6 week0   48    0 6_t1                     44.74306
##   bprs_fitted_values_BPRSL_ref1 bprs_fitted_values_BPRSL_ref2
## 1                      39.66964                      40.11147
## 2                      63.89902                      64.08993
## 3                      52.23190                      52.47526
## 4                      62.50789                      62.81131
## 5                      74.47847                      74.63778
## 6                      46.17363                      46.46693
# draw the plot of BPRSL with the Fitted values of bprs model 1
ggplot(BPRSL, aes(x = week, y = bprs_fitted_values_BPRSL_ref, group = ID, col=treatment)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
  scale_y_continuous(name = "Fitted BPRS (model 1: rnd intercept)") +
  theme(legend.position = "right") +
  facet_grid(. ~ treatment, labeller=label_both)

# draw the plot of BPRSL with the Fitted values of bprs model 2
ggplot(BPRSL, aes(x = week, y = bprs_fitted_values_BPRSL_ref1, group = ID, col=treatment)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
  scale_y_continuous(name = "Fitted BPRS (model 2: rnd intercept + slope)") +
  theme(legend.position = "right") +
  facet_grid(. ~ treatment, labeller=label_both)

# draw the plot of BPRSL with the Fitted values of bprs model 3
ggplot(BPRSL, aes(x = week, y = bprs_fitted_values_BPRSL_ref1, group = ID, col=treatment)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
  scale_y_continuous(name = "Fitted BPRS (model 3: rnd intercept + slope + interaction)") +
  theme(legend.position = "right") +
  facet_grid(. ~ treatment, labeller=label_both)

We can see how model 1 (random intercept) differs from model 2 (random intercept & slope) and model 3 (random intercept & slope + interaction). Model 1 does not take into account the effect of individual effect on BPRSs over time, as do model 2 and 3. Models 2 and 3 yield similar results as we determined from our previous analysis.

WOAH! DONE! ✓ Thoughest modeling so far, but we made it! I loved the course, would redo it anytime (well..maybe earliest after at least a two week break.) I learned so much, and more than I thought I could! HUZZAH!

Merry Christmas!

I have been Rlightened. ┌( ಠ_ಠ)┘